ROCm e HIP: Un dettagliato tutorial in 10 capitoli: La natura centrata sulla memoria delle prestazioni della GPU

Nell'accelerazione GPU, dobbiamo abbandonare la mentalità "calcolo prima di tutto". Le prestazioni moderne sono determinate da Gestione della memoria: l'organizzazione dell'allocazione, della sincronizzazione e dell'ottimizzazione dei dati tra l'host (CPU) e il dispositivo (GPU).

1. La disparità tra memoria e calcolo

Mentre il rendimento aritmetico della GPU ($TFLOPS$) è aumentato esponenzialmente, la larghezza di banda della memoria ($GB/s$) è cresciuta a un ritmo molto più lento. Ciò crea una lacuna in cui le unità di esecuzione spesso si trovano "in carestia", in attesa che i dati arrivino dalla VRAM. Di conseguenza, la programmazione GPU è spesso programmazione della memoria.

2. Il modello Roofline

Questo modello visualizza la relazione tra Intensità aritmetica (FLOPs/byte) e prestazioni. Le applicazioni si dividono tipicamente in due categorie:

Limitata dalla larghezza di banda: Limitata dalla larghezza di banda (la pendenza ripida).
Limitata dal calcolo: Limitata dai picchi di TFLOPS (il tetto orizzontale).

3. L'imposta del movimento dei dati

Il principale collo di bottiglia delle prestazioni raramente è il calcolo matematico; è la latenza e il costo energetico nel trasferire un byte attraverso il bus PCIe o dall'HBM. Il codice ad alte prestazioni privilegia la permanenza dei dati e riduce al minimo i trasferimenti tra host e dispositivo.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary cause of a GPU kernel being 'memory-bound'?

The clock speed of the GPU cores is too slow.

The rate of data delivery is slower than the rate of arithmetic execution.

There are too many threads running in parallel.

The CPU is faster than the GPU.

QUESTION 2

In the context of GPU programming, what does 'Memory Management' involve?

Only allocating variables on the CPU stack.

Controlling allocation, synchronization, and optimization of data transfer between host and device.

Optimizing the cache size of the L1 controller.

Manually cleaning the GPU registers after every kernel call.

QUESTION 3

Which axis of the Roofline Model represents 'Arithmetic Intensity'?

Vertical Axis (Y)

Horizontal Axis (X)

The slope of the line.

The area under the curve.

QUESTION 4

Why is redundant host-device transfer considered a 'performance tax'?

It consumes GPU registers.

Latency and energy consumption of moving data across PCIe is significantly higher than instruction execution.

It increases the floating-point precision error.

It causes the GPU to overheat instantly.

QUESTION 5

If a researcher's kernel spends 95% of its time 'stalled,' what is the most likely culprit?

The math instructions are too complex.

Inefficient orchestration of data residence causing the GPU to wait for data.

The GPU has too much VRAM.

The kernel was written in C++ instead of Python.